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Abstract 

A variable screening procedure via correlation learning was proposed 
in Fan and Lv (2008) to reduce dimensionality in sparse ultra-high di- 
mensional models. Even when the true model is linear, the marginal 
regression can be highly nonlinear. To address this issue, we further ex- 
tend the correlation learning to marginal nonparametric learning. Our 
nonparametric independence screening is called NIS, a specific mem- 
ber of the sure independence screening. Several closely related variable 
screening procedures are proposed. Under general nonparametric mod- 
els, it is shown that under some mild technical conditions, the proposed 
independence screening methods enjoy a sure screening property. The 
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extent to which the dimensionaUty can be reduced by independence 
screening is also exphcitly quantified. As a methodological extension, a 
data-driven thresholding and an iterative nonparametric independence 
screening (INIS) are also proposed to enhance the finite sample perfor- 
mance for fitting sparse additive models. The simulation results and a 
real data analysis demonstrate that the proposed procedure works well 
with moderate sample size and large dimension and performs better 
than competing methods. 

Keywords: Additive model, independent learning, nonparametric regression, 
sparsity, sure independence screening, nonparametric independence screening, 
variable selection. 



1 Introduction 

With rapid advances of computing power and other modern technology, high- 
throughput data of unprecedented size and complexity are frequently seen in 
many contemporary statistical studies. Examples include data from genetic, 
microarrays, proteomics, fMRI, functional data and high frequency financial 
data. In all these examples, the number of variables p can grow much faster 
than the number of observations n. To be more spec i fic, w e assume logp = 



0{n"') for some a E (0,1/2). Following iFan and Lvi ( 120091 ). we call it non- 
polynomial (NP) dimensionality or ultra-high dimensionality. What makes 
the under-determined statistical inference possible is the sparsity assumption: 
only a small set of independent variables contribute to the response. Therefore, 
dimension reduction and feature selection play pivotal roles in these ultra-high 
dimensional problems. 

The statistical literature contains numerous procedures on the variable 
selection for linear models and other parametric models, such as the Lasso 



2 



(Tibshiranj 



Fan and Li 



1996 ). the SCAD and other folded-concave penalty (IFan 



1997 



20011 ) ■ t 



le Dantzig selector ([Candes and Tao 



net (Enet) pen alty ( Zou and Hastie 



lated methods ( Zou . 



2006 



Zou and Li 



2005L the MCP ( 



2007h. th e Elastic 



Zhang 



2010|) and re- 



20081 ). Nevertheless, due to the "curse 



of dimensionality" in terms of simultaneous challenges on the computational 
expediency, statistical accuracy and algorithmic stability, these methods meet 
their limits in ultra-high dimensi onal problem s. 



Motivated by these concerns. 



Fan and Lvl (120081 ) introduced a new frame- 



work for variable screening vi a correlation lear ning with NP-dimensionality in 



the context of least squares. 



Hall et al. 



( I2OO9I ) used a different marginal util- 



ity, derived from an empirical likelihood point of view. 
proposed a g e nera 



Hall and Milled ((20091) 



ized correlation ranking, which allows nonlinear regression. 
Huang et al.l ( l2008l ) also investigated the marginal bridge regression in the or- 
dinary linear model. These methods focus on studying the marginal pseudo- 
likelihood and are fast but crude in terms of reducing the NP-diniension ality 



to a more moder ate size. To enhance the performance. 



Fan et al. 



Fan and Lvl (|2008[ ) and 



( 120091 ) introduced some methodological extensions including itera- 
tive SIS (ISIS) and multi-stage procedures, such as SIS-SCAD and SIS-LASSO, 
to select variables and estimate parameters simultaneously. Nevertheless, these 
marginal screening methods have some methodological challenges. When the 
covariates are not jointly normal, even if the linear model holds in the joint 
regression, the marginal regression can be highly nonlinear. Therefore, sure 
screening based on nonparametric marginal regression becomes a natural can- 
didate. 

In practice, there is often little prior information that the effects of the 
covariates take a linear form or belong to any other finite-dimensional para- 
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metric family. Substantial improvements are sometimes possible by using 
a more flexible class of nonparametr ic models, s uch as the additive model 
Y = J2^=i^ji-^j) + ^5 introduced by IStond (Il985l ). It increases substantially 
the flexibility of the ordinary linear model and allows a data-analytic transform 
of the covariates to enter into the linear model. Yet, the literature on vari- 
able selection in nonpa r ametr ic additive models a r e limi t ed. See, for example . 



Koltchinskii and Yuan 



and 



Meier et al 



(120091) 



(2008L 



Ravikumar et al. 



Koltchinskii and Yuan 



(120091) 



Huang et al. 



fl2008h and 



(2010) 



Ravikumar et al. 



( I2OO9I ) are closely related with COSSO proposed in lLin and Zhang ( 2006[) with 



Huang et al. 



mm 



flxed minimal signals, which does not converge to zero, 
can be viewed as an extension of a daptive lasso to additive models with flxed 
minimal signals. iMeier et al.l ( l2009l ) proposed a penalty which is a combination 



of sparsity and smoothness with a flxed design. Under ultra-high dimensional 
settings, all these methods still suffer from the aforementioned three challenges 
as they can be viewed as extensions of penalized pseudo-likelihood approaches 
to additive modeling. The commonly used algorithm in additive modeling 
such as backfltting makes the situation even more challenging, as it is quite 
computationally expensive. 

In this paper, we consider independence learning by ranking the magnitude 
of marginal estimators, nonparametric marginal correlations, and the marginal 
residual sum of squares. That is, we flt p marginal nonparametric regressions 
of the response Y against each covariate Xi separately and rank their im- 
portance to the joint model according to a measure of the goodness of flt of 
their marginal model. The magnitude of these marginal utilities can preserve 
the non-sparsity of the joint additive models under some reasonable condi- 
tions, even with converging minimum strength of signals. Our work can be 
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re garded as an impor tant and nontrivial extens ion of SIS procedures proposed 



m 



Fan and Lvl (120081 ) and lFan and Song] fj2010l ). Compared with these papers, 



the minimum distinguishable signal is related with not only the stochastic error 
in estimating the nonparametric components, but also approximation errors 
in modeling nonparametric components, which depends on the number of ba- 
sis functions used for the approximation. This brings significant challenges to 
the theoretical development and leads to an interesting result on the extent 
to which the dimensionality can be reduced by nonparametric independence 
screening. We also propose an iterative nonparametric independence screen- 
ing procedure, INIS-penGAM, to reduce the false positive rate and stabilize 
the computation. This two-stage procedure can deal with the aforementioned 
three challenges better than other methods, as will be demonstrated in our 
empirical studies. 

We approximate the nonparametric additive components by using a B- 
spline basis. Hence, the component selection in additive models can be viewed 
as a functional version of the grouped variable selection. An early litera- 



ture on the group v ariab 



Antoniadis and Fan 



e selection using group penalized least-squares is 



(120011 ) (see page 966), in which blocks of wavelet coef- 



ficients are either kif 
thorou ghly studied in 



ed or selected . The group variable se l ection was more 
Yuan and LinI (120061 ) . Kim et al.l (120061 ) . IWei and Huang 



( 120071 ) and iMeier et al.l (j2009l ). Our methods and results have important im- 
plications on the group variable selections, as in additive regression, each com- 
ponent can be expressed as a linear combination of a set of basis functions, 
whose coefficients have to be either killed or selected simultaneously. 

The rest of the paper is organized as follows. In Section 2, we introduce 
the nonparametric independence screening (NIS) procedure in additive models. 
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The theoretical properties for NIS are presented in Section 3. As a method- 
ological extension, INIS-penGAM and its greedy version g-INIS-penGAM are 
outlined in Section 4. Monte Carlo simulations and a real data analysis in Sec- 
tion 5 demonstrate the effectiveness of the INIS method. We conclude with a 
discussion in Section 6 and relegate the proofs to Section 7. 

2 Nonparametric independence screening 

Suppose that we have a random sample {(Xj, from the population 



in which X = {Xi, . . . , Xp)'^ , e is the random error with conditional mean 
zero. To expeditiously identify important variables in model ([1]), without the 
"curse-of-dimensionality" , we consider the following p marginal nonparametric 
regression problems: 



where P denotes the joint distribution of (X, F) and L2{P) is the class of 
square integrable functions under the measure P. The minimizer of ([2]) is 
fj = E{Y\Xj), the projection of Y onto Xj. We rank the utility of covariates 



covariates via thresholding. 

To obtain a sample version of the marginal nonparametric regression, we 
employ a B-Spline basis. Let Sn be the space of polynomial splines of degree 
/ > 1 and {"^jk, k = I,-- - ,dn} denote a normalized B-Spline basis with 



Y = m(X)+e, 



(1) 




(2) 



in model ([T]) according to, for example, Efj{Xj) and select a small group of 
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I^jfclloo < Ij where || ■ ||oo is the sup norm. For any fnj G iS„, we have 



k=l 



for some coefficients Under some smoothness conditions, the non- 

parametric projections {fj}^^i can well be approximated by functions in Sn- 
The sample version of the marginal regression problem can be expressed as 

min P„(r-/„,(X,))'= min P„ (f - * J/3^y , (3) 

where = ^j{Xj) = denotes the dn dimensional 

basis functions and P„5'(X, Y) is the expectation with respect to the empir- 
ical measure P„, i.e., the sample average of {g(X.i,Yi)}^^^. This univariate 
nonparametric smoothing can be rapidly computed, even for NP-dimensional 
problems. We correspondingly define the population version of the minimizer 
of the componentwise least square regression. 



where E denotes the expectation under the true model. 
We now select a set of variables 



Mu^ = {l<j<p:\\fnA\l>'^n}, (4) 

where ||/nj||^ = ''^'^^ J27=i fnji^ijY and z/„, is a predefined threshold value. 
Such an independence screening ranks the importance according to the marginal 
strength of the marginal nonparametric regression. This screening can also be 
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viewed as ranking by the magnitude of the correlation of the marginal non- 



parametric estimate {fnj{^ij)}i=i with the response since ||/, 



nj II n 



In this sens e, the proposed NIS procedure is related to the correlation 
learning proposed in iFan and Lvl ( l2008l ). 

Another screening approach is to rank according to the descent order of 
the residual sum of squares of the componentwise nonparametric regressions, 
where we select a set of variables: 



{l<j<p:Uj< 7„}, 



with Uj = min^ P„(y — ^Jf3j)'^ is the residual sum of squares of the marginal 
fit and 7„ is a predefined threshold value. It is straightforward to show that 
Uj = P„(F^ — f^j). Hence, the two methods are equivalent. 

The nonparametric independence screening reduces the dimensionality from 
p to a possibly much smaller space with model size |A^,/„| or |A/'^,J. It is appli- 
cable to all models. The question is whether we have mistakenly deleted some 
active variables in model ([1]). In othe r words, whe t her th e procedure has a sure 
screening property as postulated by iFan and Lvl ( l2008l ). In the next section, 
we will show that the sure screening property indeed holds for nonparametric 
additive models with a limited false selection rate. 



3 Sure Screening Properties 

In this section, we establish the sure screening properties for additive models 
with results presented in three steps. 
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3.1 Preliminaries 

We now assume that the true regression function admits the additive structure: 

m(X) = f^m,(X,). (5) 

For identifiabihty, we assume {mj{Xj)}j^i have mean zero. Consequently, the 
response Y has zero mean, too. Let A^* = {j : Emj{XjY > 0} be the true 
sparse model with non-sparsity size Sn = |A^*|. We allow p to grow with n 
and denote it as pn whenever needed. 

The theoretical basis of the sure screening is that the marginal signal of 
the active components j G A^^) does not vanish, where = Efj. 

The following conditions make this possible. For simplicity, let [a, b] be the 
support of Xj. 

A. The nonparametric marginal projections belong to a class of 
functions J-' whose rth derivative f^^^ exists and is Lipschitz of order a: 

J" = {/(■) : \f^'\s) - < K\s - tr, for s,t E [a, 6]}, 

for some positive constant K, where r is a non-negative integer and 
a G (0, 1] such that d = r + a > 0.5. 

B. The marginal density function gj of Xj satisfies < -ft'i < gj{Xj) < 
K2 < 00 on [a, h] for 1 < j < p for some constants Ki and K2- 

C. mmj(zM^E{EiY\Xjf} > cidnn"^", for some < k < d/{2d+ 1) and 
ci > 0. 
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Under conditions A and B, the following three facts hold when / > d and will 
be used in the paper. We state them here for readability. 



Fact 1. There exists a positive constant Ci such that (jStond . Il985l ) 



(6) 



Fact 2. Ther e exists a positive constant C2 such that (IStond . ll985l : lHuang et al.l . 



2OI0I) 



(7) 



Fact 3. Ther e exist some positive constants Di and D2 such that (jZhou et al 



19981) 



Did-^ < X„,UE^j^J) < A^ax(^*,*J) < D2d-\ (8) 

The following lemma shows that the minimum signal of {||/nj||}jGA4, is at 
the same level of the marginal projection, provided that the approximation 
error is negligible. 

Lemma 1. Under conditions A-C, we have 



mirij^MMfnjW^ > ci^dnU 



-2k 



provided that d^^'^ ^ < Ci(l — ^'^/Ci for some ^ E (0, 1). 

A model selection consistency result can be established with nonpara- 
metric independence screening under the partial orthogonality condition, i.e.. 
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{Xj, j ^ A^^} is independent of {Xi, i e A^*}. In this case, there is a sep- 
aration between the strength of marginal signals ||/nj|P for active variables 
{Xj]j G A^^} and inactive variables {Xj,j ^ A^*}, which are zero. When 
the separation is sufficiently large, these two sets of variables can be easily 
identified. 

3.2 Sure Screening 

In this section, we establish the sure screening properties of the nonparametric 
independence screening (NIS). We need the following additional conditions: 

D- ll'^lloo < -Bi for some positive constant Bi, where || ■ ||oo is the sup norm. 

E. The random error {ei}^^i are i.i.d. with conditional mean zero and for 
any B2 > 0, there exists a positive constant B-^ such that -E'[exp(_B2|£^i|) |Xi] < 



F. There exist a positive constant Ci and ^ G (0, 1) such that d^'^'^ ^ < 



The following theorem gives the sure screening properties. It reveals that 
it is only the size of non-sparse elements s.„ that matters for the purpose of 
sure screening, not the dimensionality p„. The first result is on the uniform 



Theorem 1. Suppose that Conditions A, B, D and E hold, 
(i) For any C2 > 0, there exist some positive constants C3 and C4 such that 



convergence of ||/nill^ to ||/, 



2 




(9) 
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(a) If, in addition, Conditions C and F hold, then by taking = c^dnU '^'^ 
with C5 < CiS,/2, we have 

+6dn exp ^-C4n(i~^ j | . 

Note that the second part of the upper bound in Theorem [1] is related to the 
uniform convergence rates of the minimum eigenvalues of the design matrices. 
It gives an upper bound on the number of basis d^ = o(n^/^) in order to have 
the sure screening property, whereas Condition F requires dn > B4n'^'^^^'^^~^^\ 
where = (ci(l - 

It follows from Theorem [1] that we can handle the NP-dimensionality: 

\ogpn = o{n'-'^d-' + nd-'). (10) 



Under this condition, 



PiM, c ^ 1, 

i.e., the sure screening property. It is worthwhile to point out that the number 
of spline basis dn affec ts th e order of dimensional ity, comparing with the results 



of 



Fan and Lvl ( 120081 ) and iFan and Song] (120101 ) in which univariate marginal 



regression is used. Equation (llOp shows that the larger the minimum signal 
level or the smaller the number of basis functions, the higher dimensionality 
the nonparametric independence screening (NIS) can handle. This is in line 
with our intuition. On the other hand, the number of basis functions can not 
be too small, since the approximation error can not be too large. As required 
by Condition F, dn > B^n'^'^^^'^'^'^^^] the smoother the underlying function, 
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the smaller dn we can take and the higher the dimension th at the NTS can 



hand 



fl2006h 



e. If the minimum signal d oes n ot co nverge to zero , as in 



Koltchinskii and Yuan 



Huang et al. 



as in 


Lin and Zhang 


(2010 


), then K = 0. 



(120081) and 

In this case, dn can be taken to be finite as long as it is sufficiently large so 
that minimum signal in Lemma 1 exceeds the noise l evel. By tak ing dn = 
^i/{2d+i)^ the optimal rate for nonparametric regression ( Stond . Il985l ). we have 
logpn = o(n^^'^~^^/''^'^"^^^). In other words, the dimensionality can be as high as 
exp{o(n2('^-i)/(2rf+i))}. 



3.3 Controlling false selection rates 

The sure screening property, without controlling false selection rates, is not 
insightful. It basically states that the NIS has no false negatives. An ideal 
case for the vanishing false positive rate is that 

max Wfnjf = o{dnn~'^'^), 

SO that there is a gap between active variables and inactive variables in model 
([!]) when using the marginal nonparametric screener. In this case, by Theorem 
[U^i), if ([9]) tends to zero, with probability tending to one that 

max < C2dnn~'^'^, for any C2 > 0. 

Hence, by the choice of Un as in Theorem [T](ii), we can achieve model selection 
consistency: 

P(^.„=A^.) = 1-0(1). 
We now deal with the more general case. The idea is to bound the size 
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of the selected set by using the fact that var(y) is bounded. In this part, we 
show that the correlations among the basis functions, i.e., the design matrix 
of the basis functions, are related to the size of selected models. 

Theorem 2. Suppose Conditions A-F hold and var{Y) = 0(1). Then, for 
any z/„ = c^dnti''^^ , there exist positive constants C3 and C4 such that 



> 1 — Pndn^{% + 2dn) exY>{—Oin^"'^'^d~^) + 6(i„exp(— C4n(i~^)|, 



where S = E^^^ and * 



The significance of the result is that when Amax(S) = O(n^), the se- 
lected model size with the sure screening property is only of polynomial order, 
whereas the original model size is of NP-dimensionality. In other words, the 
false selection rate converges to zero exponentially fast. The size of the se- 
lected variables is of order Oin'^'^^'^). This is of the same order as in Fan and 



Lv (2008). Our result is an extension of iFan and Lvl (120081 ) . even in this very 



specific case without the conditio n 2k + r < 1. The results are also consistent 



with that in 



Fan and SongI (120101 ): the number of selected variables is related 



to the correlation structure of the covariance matrix. 

In the specific case where the covariates are independent, then the matrix S 
is block diagonal with j-th block Sj. Hence, it follows from ^ that Amax(S) = 
O(d-i). 
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4 INIS Method 



4.1 Description of the Algorithm 

After variable screening, the next step is naturally to select the variables using 
more refined techniques in the additive model. For examp le, the penalized 



method for additive model (penGAM) in iMeier et al.l ( l2009l ) can be employed 
to select a subset of active variables. This results in NIS-penGAM. To fur- 
ther enh ance the perf o rman ce of the metho d , in te rms of false selection rates. 



following iFan and Lvl (j2008l ) and 



Fan et al. 



(I2OO9I ). we can iteratively employ 



the large-scale screening and moderate-scale selection strategy, resulting in the 
INIS-penGAM. 

Given the data {(X.i,Yi)},i = I,-- - ,n, for each component fj{-),j = 
1, ■ ■ ■ ,p, we choose the same truncation term dn = 0{n^^^). To determine 
a data-driven thresh olding for ind e pende nce screening, we extend the random 
permutation idea in IZhao and Lil (120101 ). which allows only 1 — q proportion 
(for a given q G [0, 1]) of inactive variables to enter the model when X and 
Y are not related (the null model). The random permutation is used to de- 
couple Xj and Yi so that the resulting data (X^(j),yj) follow a null model, 
where vr(l), ■ ■ ■ , 7r{n) are a random permutation of the index 1, ■ ■ ■ ,n. The 
algorithm works as follows: 

Step 1: For every j G {1, ■ ■ ■ we compute 



fnj = argmin^^^g5,/„( F - fnj{Xj) ) , for 1 < j < p. 



Randomly permute the rows of X, yielding X. Let uj(^qj be the g*'^ quantile 
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of {WfnjWLj = 1,2,-- - ,p}, where 

Then, NIS selects the following variables: 

A, = {j:\\f:X>^M}- 

In our numerical examples, we use q = 1 (i.e., take the maximum value 
of the empirical norm of the permuted estimates). 



Step 2: W e apply furt 



m 



l er the penalized method for additive model (penGAM) 



Meier et al 



(j2009[ ) on the set Ai to select a subset M.i. Inside the 
penGAM algorithm, the penalty parameter is selected by cross valida- 
tion. 

Step 3: For every j G Ail = {1, ■ ■ ■ ,p}\A^i, we minimize 

ieMi 

with respect to /„« G iS„ for alH G A^i and fnj G iS„. This regression 
reflects the additional contribution of the j-th components conditioning 
on the existence of the variable set Aii. After marginally screening as in 
the first step, we can pick a set A2 of indices. Here the size determination 
is the same as in Step 1, except that only the variables not in A^i are 
randomly permuted. Then we apply further the penGAM algorithm on 
the set M.i[jA2 to select a subset A^2. 

Step 4: We iterate the process until \A4i\ > sq or Aii = J^i-i- 
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Here are a few comments about the method. In Step 2, we use the penGAM 
method. In fact, any variable selection method for additive models will work 

Ravikumar et al.l fl2009[) a nd also the adaptive group 



such as the SpAM in 
LASSO for additive models in 
idea as described in 
false selection rate. 



Huang et al.l ( l2010l ). A similar sample splitting 



Fan et al.l ( 120091 ) can be applied here to further reduce 



4.2 Greedy INIS (g-INIS) 



We now propose a greedy modification to the INIS algorithm to speed up the 
computation and to enhance the performance. Specifically, we restrict the size 
of the set Aj in the iterative screening steps to be at most po, a small positive 
integer, and the algorithm stops when none of the variables is recruited, i.e., 
exceeding the thresholding for the null model. In the numerical studies, po is 
taken to be one for simplicity. This greedy version of the INIS algorithm is 
called "g-INIS". 

When jPn = 1, the g-INIS method i s con nected with the fo r ward selection 



tefroymsonl . 



1960 



Draper and Smithl . Il966l ). Recently, IWangI (120091 ) showed 



that under certain technical conditions, forward selection can also achieve the 
sure screening property. Both g-INIS and forward selection recruit at most 
one new variable into the model at a time. The major difference is that 
unlike the forward selection which keeps a variable once selected, g-INIS has 
a deletion step via penalized least-squares that can remove multiple variables. 
This makes the g-INIS algorithm more attractive since it is more flexible in 
terms of recruiting and deleting variables. 

The g-INIS is particularly effective when the covariates are highly cor- 
related or conditionally correlated. In this case, the original INIS method 
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tends to select many unimportant variables that have high correlation with 
important variables as they, too, have large marginal effects on the response. 
Although greedy, the g-lNlS method is better at choosing true positives due to 
more stringent screening and improves the chance of the remaining important 
variables to be selected in subsequent stages due to less false positives at each 
stage. This leads to conditioning on a smaller set of more relevant variables 
and improve the overall performance. From our numerical experience, the g- 
INIS method outperforms the original INIS method in all examples in terms 
of higher true positive rate, smaller false positive rate and smaller prediction 
error. 



5 Numerical Results 



In this section, we will illustrate our method by studying the performance on 
the simulated da ta and a rea l data analysis. Part of t h e simulation setting s 



are adapted from 



and 



Fan and SoneJ (120101 ). 



'an and Lvl (120081 ) 



Meier et al. 



20091 ) . iHuang et all ( I2010h 



5.1 Comparison of Minimum Model Size 

We first illustrate the behavior of the NIS pr ocedure under different correlation 
structures. Following iFan and Song] (120101 ). the minimum model size(MMS) 
required for the NIS procedure and the penGAM procedure to have the sure 
screening property, i.e., to contain the true model Ai*, is used as a measure 
of the effect iveness of a screen ing method. We also include the correlation 
screening of iFan and Lvl (120081 ) for comparison. The advantage of the MMS 
method is that we do not need to choose the thresholding parameter or penal- 
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ized parameters. For NIS, we take dn = [n^^^\ +2 = 5. We set n = 400 and 
p = 1000 for all examples. 
Example 1. Following 



Fan and Sond mm . let {Xk}f^^ be i.i.d standard 



normal random variables and 



Ex, 



/5 



^ " 25"^' 



k = 951,- ■■ ,1000, 



where {ek}k^g5i are standard normally distributed. We consider the following 
linear model as a specific case of the additive model: Y = (3*^1^. + e, in which 
e ~ A^(0,3) and (3* = (1,— l,---)"^ has s non-vanishing components, taking 
values ±1 alternately. 

Example 2. In this example, the data is generated from the simple linear 
regression Y = Xi+X2+X3+e, where e ~ A^(0, 3). However, the covariates are 
not normally distributed: {Xk}kyt2 are i.i.d standard normal random variables 
whereas X2 = —^Xf + e, where e ~ A^(0, 1). In this case, E{Y\Xi) and 
E{Y\X2) are nonlinear. 



Table 1: Minimum model size and robust estimate of standard deviations (in 
parentheses) . 



Model 



NIS 



PenGAM SIS 



Ex 1 (s = 3,5iVi?^ 1.01) 3(0) 3(0) 

Ex 1 (s = 6, SNR ^ 1.99 ) 56(0) 1000(0) 

Ex 1 (s = 12, SNR ^ 4.07) 66(7) 1000(0) 

Ex 1 (s = 24, SNR ^ 8.20) 269(134) 1000(0) 



Ex 2 {SNR ^ 0.83) 



3(0) 



3(0) 



3(0) 

56(0) 

62(1) 

109(43) 

360(361) 



The minimum model size(MMS) for each method and its associated ro- 
bust estimate of the standard deviat ion (i^S'il' = IQR/1.3A) are shown in 
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Table [U The column "NIS" , "penGAM" , and "SIS" summarizes the results 
on the MMS based on 100 simulations, respectively for the nonparametric in- 
dependence screen ing in the paper, penalized method for additi ve model of 



Meier et al. 



Fan and Lv 



( I2OO9I ). and the linear correlation ranking method of 
(l2008l ). For Example 1, when the nonsparsity size s > 5, the irrepresentable 
condition required for the model selection consistency of LASSO fails. For 
these cases, penGAM fails even to include the true model until the last step. 
In contrast, the proposed nonparametric independence screening performs rea- 
sonably well. It is also worth noting that SIS performs better than NIS in the 
first example, particularly for s = 24. This is due to the fact that the true 
model is linear and the covariates are jointly normally distributed, which im- 
plies that the marginal projection is also linear. In this case, NIS selects 
variables from pdn parameters whereas SIS selects only from p parameters. 
However, for the nonlinear problem like Example 2, both nonlinear method 
NIS and penGAM behave nicely, whereas SIS fails badly even though the 
underlying true model is indeed linear. 



5.2 Comparison of Model Selection and Estimation 

As in the previous section, we set n = 400 and p = 1000 for all the examples 
to demonstrate the power of our newly proposed methods INIS and g-INIS. 
Here in the NIS step, we fix = 5 as in the last subsection. The number of 
simulations is 100. Here, we use five-fold cross validation in Step 2 of the INIS 
algorithm. For simplicity of notations, we let 

/ N / N /o iN2 / N sin(27rx) 

Qiix) = X, Q2[x) = [2x — 1) , Q^ix) = ; 

yiv ; , yiv ; V ; > yav ; 2 - sin(27rx) 
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and 



gi{x) = 0.1 sin(27rx)+0.2 cos(27rx)+0.3 sin(27rx)^+0.4 cos(27rx)^+0.5 sin(27rx) 



Example 3. Following 
following additive model: 



Meier et al 



( I2OO9I ). we generate the data from the 



Y = 5(7i(Xi) + 3(72(^2) + 4(73(X3) + Gg^iX^) + VhUe 

The covariates X = (Xi, ■ ■ ■ ^Xp)"^ are simulated according to the random 
effect model 



X, 



Wj + tu 
1 + t 



where Wi,--- ,Wp and U are i.i.d. Unif(0, 1) and e ~ iV(0, 1). When t = 0, 
the covariates are all independent, and when t = 1 the pairwise correlation of 
covariates is 0.5. 

Example 4. Again, we adapt the simulation model from 



Meier et al. 



(120091 ). This example is a more difficult case than Example 3 since it has 12 



important variables with different coefficients. 



Y = g,{X^) + g2iX2) + gsiXs) + g,{X,) 

+ 1.5g,{X,) + 1.5g2{X,) + 1.5g,{Xj) + 1.5g,{Xs 



2g^{Xg) + 2(72(Xio) + 2gsiXn) + 2g,{Xu) + V0:5184£, 



where e ~ iV(0, 1). The covariates are simulated as in E xample 3. 



Example 5. We follow the simulation model of 



Fan et al. 



fl2009h . in 



which Y = PiXi + (32X2 + /33X3 + /34X4 + e is simulated, where e ~ iV(0, 1). 
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The covariates Xi, - ■ ■ , Xp are jointly Gaussian, marginally A^(0, 1), and with 
corr(Xj,X4) = l/\/2 for all i 7^ 4 and corr(Xj,Xj) = 1/2 if i and j are dis- 
tinct elements of {1, ■ ■ ■ ,p}\{4}. The coefficients f3i = 2,(32 = 2,/33 = 2,/34 = 
-3V2, and = for j > 4 are taken so that X4 is independent of F, even 
though it is the most important variable in the joint model, in terms of the 
regression coefficient. 

For each example, we compare the perf ormances of INIS-p enGAM, g-INIS- 
penGAM propose d in the paper, penGAM flMeier et allboogf ). and ISIS-SCAD 
( iFan et al.l . 120091 ) which aims for sparse linear model. Their results are shown 
respectively in the rows "INIS", "g-INIS", "penGAM" and "ISIS" ofTable2,in 
which the True Positives(TP), False Positives(FP), Prediction Error (PE) and 
Computation Time (Time) are reported for each method. Here the prediction 
error is calculated on an independent test data set of size n/2. 

First of all, for the greedy modification, g-INIS-penGAM, the number of 
false positive variables is approximately 1 for all examples and the number of 
false positive for both INIS-penGAM and ISIS-SCAD are much smaller than 
that for penGAM. In terms of false positives, we can see that in Examples 3 and 
4, INIS-penGAM and penGAM have similar performance, whereas penGAM 
misses one variable most of the time in Example 5. The linear method ISIS- 
SCAD missed important variables in the nonlinear models in Examples 3 and 
4. 

One may notice that in Example 4 (t = 1), even INIS and g-INIS miss 
more than one variables on average. To explore the reason, we took a close 
look at the iterative process for this example and find out the variable Xi 
and X2 are missed quite often. The explanation is that although the overall 
SNR (Signal to Noise Ratio) for this example is around 10.89, the individual 
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Table 2: Average values of the numbers of true (TP) and false (FP) positives, 
prediction error (PE), and Time (in seconds). Robust standard deviations are 
given in parentheses. 



Model 




Method 


TP 


FP 


PE 


Time 






INIS 


4.00(0.00) 


2.58(2.24) 


3.02(0.34) 


18.50(7.22) 


Ex 3{t = 


0) 


g-INIS 


4.00(0.00) 


0.67(0.75) 


2.92(0.30) 


25.03(4.87) 


{SNR^ 


9.02) 


penGAM 


4.00(0.00) 


31.86(23.51) 


3.30(0.40) 


180.63(6.92) 






ISIS 


3.03(0.00) 


29.97(0.00) 


15.95(1.74) 


12.95(4.18) 






INIS 


3.98(0.00) 


15.76(6.72) 


2.97(0.39) 


78.80(26.91) 


Ex 3{t = 


1) 


g-INIS 


4.00(0.00) 


0.98(1.49) 


2.61(0.26) 


33.89(9.99) 


{SNR^ 


7.58) 


penGAM 


4.00(0.00) 


39.21(24.63) 


2.97(0.28) 


254.06(13.06) 






ISIS 


3.01(0.00) 


29.99(0.00) 


12.91(1.39) 


18.59(4.37) 






INIS 


11.97(0.00) 


3.22(1.49) 


0.97(0.11) 


73.60(25.77) 


Ex 4(t = 


0) 


g-INIS 


12.00(0.00) 


0.73(0.75) 


0.91(0.10) 


160.75(19.94) 


{SNR^ 


8.67) 


penGAM 


11.99(0.00) 


80.10(18.28) 


1.27(0.14) 


233.72(10.25) 






ISIS 


7.96(0.75) 


25.04(0.75) 


4.70(0.40) 


12.89(5.00) 






INIS 


10.01(1.49) 


15.56(0.93) 


1.03(0.13) 


125.11(39.99) 


Ex 4(t = 


1) 


g-INIS 


10.78(0.75) 


1.08(1.49) 


0.87(0.11) 


156.37(28.58) 


{SNR^ 


10.89) 


penGAM 


10.51(0.75) 


62.11(26.31) 


1.13(0.12) 


278.61(16.93) 






ISIS 


6.53(0.75) 


26.47(0.75) 


4.30(0.44) 


17.02(4.01) 






INIS 


3.99(0.00) 


21.96(0.00) 


1.62(0.18) 


94.50(7.12) 


Ex 5 




g-INIS 


4.00(0.00) 


1.04(1.49) 


1.16(0.12) 


39.78(12.45) 


{SNR^ 


6.11) 


penGAM 


3.00(0.00) 


195.03(21.08) 


1.93(0.28) 


1481.12(181.93) 






ISIS 


4.00(0.00) 


29.00(0.00) 


1.40(0.17) 


17.78(3.85) 
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contributions to the total signal vary significantly. Now, let us introduce the 
notion of individual SNR. For example, var(mi(Xi))/var(£) in the additive 
model 

Y = mi(Xi) + ■ ■ ■ + mp{Xp) + e 

is the individual SNR for the first component under the oracle model where 
m2, ■ ■ ■ , rup are known. In Example 4 (t = 1), the variance of all 12 components 
are as follows: 

1 2 3 4 5 6 7 8 9 To Tl ItT 
0.08 0.09 0.21 0.26 0.19 0.20 0.47 0.58 0.33 0.36 0.84 1.03 



We can see that the variance varies a lot among the 12 components, which 
leads to very different marginal SNRs. For example, the individual SNR for 
the first component is merely 0.08/0.518 = 0.154, which is very challenging 
to be detected. With the overall SNR fixed, the individual SNRs play an 
important role in measuring the difficulty for selecting individual variables 

In the perspective of the prediction error, INIS-penGAM, g-INIS-penGAM 
and penGAM outperforms ISIS-SCAD in the nonlinear models whereas their 
performances are worse than ISIS-SCAD in the linear model. Example 5. Over- 
all, it is quite clear that the greedy modification g-INIS is a competitive vari- 
able selection method in ultra-high dimensional additive models where we have 
very low false selection rate, small prediction errors, and fast computation. 

5.3 dn and SNR 

In this subsection, we conduct simulation study to investigate the performance 
of INIS-penGAM estimator under different SNR settings using different num- 
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ber (dn) of basis functions. 

Example 6. We generate the data from the following additive model: 

Y = 3(71 (Xi) + Sg^iX^) + 2gsiX,) + 2g,{X,) + 0^3:38438, 

where the covariates X = (Xi, ■ ■ ■ ,Xp)'^ are simulated according to Example 
3. Here C takes a series of different values (C^ = 2, 1,0.5,0.25) to make the 
corresponding SNR = 0.5,1,2,4. We report the results of using number of 
basis functions dn = 2, 4, 6, 8, in Tables 4 and 5 in the Appendix. 

From Table H] in the Appendix where all the variables are independent, 
both methods have very good true positives under various SNR when dn is 
not too large. However, for the case of SNR = 0.5 and dn = 16, the INIS and 
penGAM perform poorly in terms of low true positive rate. This is due to 
the fact that when dn is large, the estimation variance will be large and this 
makes it difficult to differentiate the active variables from inactive ones when 
the signals are weak. 

Now let us have a look at the more difficult case in Table |5] (in the Ap- 
pendix) where pairwise correlation between variables is 0.5. We can see that 
INlS have a competitive performance under various SNR values except when 
dn = 16. When SNR = 0.5, we can not achieve sure screening under the 
current sample size and configuration for the aforementioned reasons. 



5.4 An analysis on AfFymetric GeneChip Rat Genome 
230 2.0 Array 



We us e the data set reported in 



Scheetz et al. 



( 120061 ) and analyzed by 



Huang et al. 



( I2OIOI ) to illustrate the application of the proposed method. For this data set. 
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120 twelve-week-old male rats were selected for tissue harvesting from the eyes 
and for microarray analysis. The microarrays used to analyze the RNA from 
the eyes of these animals contain over 31,042 different probe sets (Affymetric 
GeneChip Rat Genome 230 2.0 Array). The i ntensity values were normalized 



using the robust multi-chip averaging method fjirizarry et al. 



2003f ) method to 



obtain summary expression values for each probe set. Gene expression levels 
were analyze d on a logarit h mic s cale. 



Following 



Huang et al 



fl2010[ ). we are interested in finding the genes that 
are related to th e gene TRIM32, wh ich was recently found to cause Bardet- 
Biedl syndrome (jChiang et al.l . 120061 ). and is a genetically heterogeneous dis- 
ease of multiple organ systems including the retina. Although over 30,000 
probe sets are represented on the Rat Genome 230 2.0 Array, many of them 
are not expressed in the eye tissue. We only focus on the 18975 probes 
which are expressed in the eye tissue. We use our INIS-penGAM method 
directly on this dataset, where n = 120 and p = 18975, and the method is 

ication of penGAM ap- 



Huang et al. 



torn , we 



denoted as INIS-penGAM {p = 18975). Direct app: 
proach on the whole dataset is too slow. Following 
use 2000 probe sets that are expressed in the eye and have highest marginal 
correlation with TRIM32 in the analysis. On the subset of the data (n = 
120, p = 2000), we apply the INIS-penGAM and penGAM to model the 
relation between the expression of TRIM32 and those of the 2000 genes. 
For simplicity, we did not implement g-INIS-penGAM. Prior to the analy- 
sis, we standardize each probe to be of mean and variance 1. Now, we 
have three different estimators, INIS-penGAM (p = 18975), INIS-penGAM 
{p = 2000) and penGAM (p = 2000). The INIS-penGAM (p = 18975) se- 
lects the following 8 probes: 1371755_at, 1372928_at, 1373534_at, 1373944_at, 
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Figure 1: Fitted regression functions for the 8 probes that are selected by 
INIS-penGAM (p = 18975). 




-4 -2 2 -4 -2 2 

1376747_at 1377880_at 



1374669_at, 1376686_at, 1376747_at, 1377880_at. The INIS-penGAM {p = 
2000) selects the following 8 probes: 1376686_at, 1376747_at, 1378590_at, 
1373534_at, 1377880_at, 1372928_at, 1374669_at, 1373944_at. On the other 
hand, the penGAM {p = 2000) selects 32 probes. The residual sum of squares 
(RSS) for these fittings are 0.24, 0.26 and 0.1 for INIS-penGAM {p = 18975), 
INIS-penGAM {p = 2000) and penGAM {p = 2000), respectively. 

In order to further evaluate the performances of the two methods, we use 
cross-validation and compare the prediction mean square error (PE). We ran- 
domly partition the data into a training set of 100 observations and a test 
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Table 3: Mean Model Size (MS) and Prediction Error (PE) over 100 repetitions 
and their robust standard deviations(in parentheses) for INIS {p = 18975), 
INIS {p = 2000) and penGAM {p = 2000). 



Method 


MS 


PE 


INIS {p = 18975) 
INIS {p = 2000) 
penGAM (p = 2000) 


7.73(0.00) 
7.68(0.75) 
26.71(14.93) 


0.47(0.13) 
0.44(0.15) 
0.48(0.16) 



set of 20 observations. We compute the number of probes selected using the 
100 observations and the prediction errors on these 20 test sets. This process 
is repeated 100 times. Table [3] gives the average values and their associated 
robust standard deviations over 100 replications. It is clear in the table that 
by applying the INIS-penGAM approach, we select far fewer genes and give 
smaller prediction error. Therefore, in this example, the INIS-penGAM pro- 
vides the biological investigator a more targeted list of probe sets, which could 
be very useful in further study. 

6 Discussion 

In this paper, we study the nonparametric independence screening (NIS) method 
for variable selection in additive models. B-spline basis functions are used for 
fitting the marginal nonparametric components. The proposed marginal pro- 
jection criteria is an important extension of the marginal correlation. Iterative 
NIS procedures are also proposed such that variable selection and coefficient 
estimation can be achieved simultaneously. By applying the INIS-penGAM 
method, we can preserve the sure screening property and substantially reduce 
the false selection rate. A greedy modification of the method g-INIS-penGAM 
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is proposed to further reduce the false selection rate. Moreover, we can deal 
with the case where some variable is marginally uncorrelated but jointly cor- 
related with the response. The proposed method can be easily generalized to 
generalized additive model with appropriate conditions. 

As the additive components are specifically approximated by truncated 
series expansions with B-spline bases in this paper, the theoretical results 
should hold in general and the proposed framework can be readily adap- 



tive to other smo othing methods with additive models 



Silverman 



19841 ). such as local polynomial regression (iFan and Jiang 



wavelets approximatio nsdAntoniadis and Fan 



2001 



Horowitz et a. 



2006 



2005 



Sardy and Tseng] . 



20041 ) 



and smoothing spline ( ISpeckmanl . Il985l ). This is an interesting topic for fu- 
ture research. 



7 Proofs 

Proof of Lemma [H 

By the property of the least-squares, E{Y — fnj)fnj = and E{Y — fj)fnj = 
0. Therefore, 

Efnjifj - fnj) = E{Y - - EiY - f,)f^, = 0. 

It follows from this and the orthogonal decomposition fj = fnj + {fj — fnj) 
that 

ll/n,ir=||/,f -II/.- /n,f- 

The desired result follows from Condition C together with Fact 1. □ 
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le following two types of Bernstein's inequality in Ivan der Vaart and Wellner 



(119961 ) will be needed. We reproduce them here for the sake of readability. 



Lemma 2 (Bernstein's inequality, Lemma 2.2.9, 



van der Vaart and Wellner 



(119961 )). For independent random variables Yi,--- ,Yn with bounded ranges 



[— M, M] and zero means, 

P{\Yi + --- + Yn\ >x) < 2exp{-xV(2(t; + Mx/3))}, 
for V > var{Yi + ■ ■ ■ + Yn). 



Lemma 3 (Bernstein's inequality. Lemma 2.2. 11. Ivan der Vaart and Wellner 



(119961 )). Let Yi, - ■ ■ ,Yn be independent random variables with zero mean such 
that < m\M'^~'^Vi/2, for every m > 2 (and all i) and some constants 

M and Vi . Then 



P{\Yr + --- + Yn\> x) <2 exp{-xV(2(t; + Mx))}, 

for V >Vi + ■ ■ -Vn- 

The following two lemmas will be needed to prove Theorem [H 

Lemma 4. Under Conditions A, B and D, for any 5 > 0, there exist some 
positive constants Cq and cj such that 

P(|(P„ - E)^,kY\ > 5n-') < 4exp(-5V2(c6nd;i + 0^6)), 
for k = 1, - ■ ■ ,dn, j = I, - ■ ■ ,P- 
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Proof of Lemma [7} 

Denote by Tj^i = '^jk{Xij)Yi - E'^jk{Xij)Yi. Since Yi = m{Xi) + Si, we 
can write Tjki = Tjkn + Tjki2, where 

Tjkii = «'jfc(X,j)m(Xi) -E^'jfe(Xij)m(Xi), 

and Tjki2 = '^jk{Xij)ei. 

By Conditions A, B, D and Fact 2, recalling ll^&jfcHoo < 1? we have 

\Tjk^l\ < 2Si, var(T,Hi) < E-^%{X,^)m,{X,,f < BlC^d-'. (12) 

By Bernstein's inequality (Lemma [2]), for any Si > 0, 



(13) 



Next, we bound the tails of Tjki2- For every r > 2, 



E\TJk^2r < E\^,k{Xi,)\^E{\e,\'-\X,) 

< r\B^''E\^,kiXi,)\^Eexp{B2\ei\\X,) 

< BsC2d-'r\Br, 

where the last inequality utilizes Condition E and Fact 2. By Bernstein's 
inequality (Lemma [3]), for any 62 > 0, 



P{\J2Tjki2\ > 52) < 2exp(-- 

1=1 



2nB^^BsC2d~^ + B^'62J' 



(14) 



Combining ( |T3i) and ( fT4l) . the desired result follows by taking cq = max(i?^C2, 2i?2 ^-836*2) 
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and Cy = max(2/3i?i, 82^)- □ 

Throughout the rest of the proof, for any matrix A, let || A|| = \J Amax(A"^A) 
be the operator norm and ||A||oo = maxj j \Aij\ be the infinity norm. The next 
lemma is about the tail probability of the eigenvalues of the design matrix. 

Lemma 5. Under Conditions A and B, for any 5 > 0, 

2 



,2 r 1 5' 1 



2C2nd-^ + 5/3. 

In addition, for any given constant C4, there exists some positive constant cg 
such that 

< 2d^exp(^-C4n(i;;;^y (15) 
Proof of Lemma [B 

For any symmetric matrices A and B and any ||x|| = 1, where || ■ || is the 
Euclidean norm, 



x'^fA + B)x = x"^Ax + x'^Bx > min x^Ax + min x'^Bx. 

I|X||=1 l|X||=l 



Taking minimum among ||x|| = 1 on the left side, we have 



min X (A + B)x > min x Ax + min x Bx, 

||x||=i l|x|l=i l|x||=i 



which is equivalent to Amin(A + B) > Amin(A) + Amin(B). 
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Then we have 

Amin(A) > Amin(B) + Ainin(A - B), 

which is the same as 

Amin(A — B) < Ainin(A) — Amin(B). 

By switching the roles of A and B, we also have 

Amin 

(B — A) < Amin(B) — Amin (A) 

In other words, 

|Amin(A) - Amin(B)| < max{|Amin(A - B)|, |Amin(B - A)|} (16) 

Let = P„*j*J - Then, it follows from (HE]) that 

|Amin(F„*i*J) - Amin(^*i*J)| < max{|Amin(D,)|, |Amin(-D,)|}. (17) 

We now bound the right-hand side of f[T71) . Let D^*'''' be the (i, /) entry of Dj. 
Then, it is easy to see that for any ||x|| = 1, 

|x'^D,-x| < ||D,|u(5^|xi|) < rf„||D,-||oo. (18) 
1=1 

Thus, 

Aniin(D,) = min x^D,x < (i„||DJ|oo• 
■^ ||x||=i 
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On the other hand, by using fll8p again, we have 



Amin(Dj) = - ma^X (-x'^DjX) > -dn\\^j\\oc- 



We conclude that 



I ■^min(Dj) I ^ djj II Dj II 



The same bound on |Amin(— Dj)| can be obtained by using the same argument. 
Thus, by (fT7|l . we have 

|A„,in(P„*,-*J) - X^UE^j^J)\ < dn\\Bj\\oo- (19) 

We now use Bernstein's inequahty to bound the right-hand side of f[T^ . 
Since ||^I/jfc||oo < 1, and by using Fact 2, we have that 

var(vl/,-,(X,)vl/,,(^,)) < i?^],(X,)vl;2^(X,) < E<il%iX,) < C,d-\ 
By Bernstein's inequahty (Lemma [2]), for any S > 0, 

P(|(P„ - E)vl/,-,(X,)vl/,,(^.)l > S/n) < 2exp{-^^^r;^^rr^^}. (20) 
It foUows from ( IT9l) . (l20l) and the union bound of probabihty that 

P(|A„in(Pn*,*J) - A^i„(P*,*J)| > dj/n) 

This completes the proof of the first inequality. 

To prove the second inequality, let us take 6 = cgDind~^ in (120|) . where 
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Cg G (0, 1). By recalling Fact 3, it follows that 



< 2dlexp(^-c,nd-^'^, (21) 



for some positive constant C4. The second part of the lemma thus follows from 
the fact that AminfH)"^ = AmaT^fH"^), if we establish 



P 



< 2d'^exp(^—C4nd. 



-3' 
n I 1 



(22) 



by using ( I2T1) . where cs = 1/(1 — cg) — 1. 

We now deduce ([22]) from Let A = Ai^in(P„*j*p and B = Amin(^*j*J) 
Then, A > and B > 0. We aim to show for a G (0, 1), 

1^-1 - B-^\ > cB~^ implies \A - B\ > aB, 

where c = 1/(1 — a) — 1. 
Since 

> (l/(l-a)-l)fi-\ 

we have 

A^^ -B-^ <-{l/{l-a)-l)B-\ or > (1/(1 - a) - 1)5"^ 



Note that for a G (0, 1), we have 1 — 1/(1 + a) < 1/(1 — a) — 1. Then it follows 
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that 



A-^ - B-^ < -{1-1/(1 + a))B-\ or > (1/(1 - a) - 

which is equivalent to 1^4 — i?| > aB. 

This concludes the proof of the lemma. □ 

Proof of Theorem [II 

We first show part (i). Recall that 

= (p„*,r)'^(p„*,*J)-ip„*,y, 

and 

||/„,f = (E*,F)^(E*,*J)-iE*,r. 

Let a„ = P„*,r, B„ = (P„*,*J)-i, a = E^jY and B = {E^^^J)-\ 
By some algebra, 

a^B.„a„ - a^Ba = (a„ - a)^B„(a„ - a) + 2(a„ - a)^B„a + a^(B.„ - B)a, 
we have 

\\U\l-\\fnjf = S^ + S, + S,, (23) 
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where 



T 



Note that 



5i<||(P„*,*J)"i| 



(24) 



By Lemma H] and the union bound of probabihty, 

P(||P„*jF - E^.Yf > dj^n-^) < 4d„exp(-5V2(c6nc/;i + c-r6)). (25) 

Recall the result in Lemma E] that, for any given constant C4, there exists a 
positive constant cs such that 



>cs\\{E^,^'j 



T\~l\ 



< 2dlexp(^-Cind„ 



Since by Fact 3, 



(E*,*J)-i <D^^d, 



it follows that 



P{||(P„*,*J)-^|| > (cg + l)Dr'^n} < 2d^exp(-C4n<3). (26) 
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Combining f l24p -( 12B]) and the union bound of probability, we have 
PiSi > (C8 + l)D^^dl5^/n^) < 4d„exp(-(5V2(c6nrf-^ + cjS)) + 2dlexp(^-C4nd 
To bound 5*2, we note that 



< 2||p„*,r-i?*,F||.||(p„*,*j)"^E*,r|| 



< 2||P„*,r - E^,Y\\ ■ ||(P„*,*f)"l ■ \\E^,Y\\. {21 



Since by Condition D, 

dn dn dji 

fe=l k=l k=l 

it follows from ( 125|) . (!26|) . ( p8|) . fl29|) and the union bound of probability that 

< 4d„ exp(-(5V2(c6nrf-i + C7(5)) + 2^^ exp (^-and-^^ . (30) 
Now we bound 5*3. Note that 
^3 = (E*,F)^(P„*,*J)-i(e - P„)*,*J(E*,*J)-iE*,F. (31) 
By the fact that ||AB|| < ||A|| ■ ||B||, we have 



ISsl < ||(Pn - E)^,^J\\ ■ ||(P„*,*J)-i ■ ||(E*,*J)-i ■ \\E^,Y\\'. (32) 
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For any ||x|| = 1 and (i„-dimensional square matrix D, 



Therefore, ||D|| < (i„||D||oo- We conclude that 



(33) 



By ([20]), (121]), ([2S]), (122]), (JM]) and the union bound of probabihty, it follows 
that 



Pi\S3\ > {c^ + l)D^^BlC2dl5/n) 

< 2dl exp(-5V2(c6nrf-^ + Cj5)) + 2dl exp i-dnd-^ ) . (34) 



It follows from ([23]), ([27]), ([SU]), ([M]) and the union bound of probability 
that for some positive constants cio, cn and Ci2, 

^ (I WfnjWl - ll/n,f I > Cio^/^^V^' + C^ldT^/n + Ci2f/^5/^^ 

< {Mn + 26/2 ) exp(-5V2(c6nrf-^ + c-j5)) + Grf^ exp (^-04^^-^^ . (35) 

In m. let c,odl5^/n^ + cnrf^^'V^ + c^^rf^V^ = c.dr^n'^^ for any given 
C2 > 0, i.e., taking 5 = n^~'^^d~'^c-2/ C12, there exist some positive constants C3 
and C4 such that 



ll/rijlln II /i 



-2k\ 



< {Mn + 2rf^) exp(-C3n^"^''(i;;^) + Qdl exp (^-C4n(i~^ j . 
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The first part thus follows the union bound of probability. 
To prove the second part, note that on the event 

An = {max WfnjWl - ||/„jf < ci^dnn^^y2}, 

by Lemma 1, we have 

WfnjWl > c^idnn-^^/2, for all j G M.. (36) 

Hence, by the choice of Vn-, we have A^^ C M.v^- The result now follows from 
a simple union bound: 

P{K) < Sn[{8dn + 2dl)exp(^-csn^-^''d-'^^ + 6d^exp(^-C4nd;=^) }. 

This completes the proof. □ 

Proof of Theorem [H The key idea of the proof is to show that 

||E*Ff = 0(A,,ax(S)). (37) 
If so, by definition and ||\I^jfc||oo < 1, we have 

Pn 

< max A^ax{(^*,*J)-'}||^*l^||' = 0(rfAnax(S)). 

This implies that the number of {j : > ednn~'^'^} can not exceed 

0(n^''Amax(S)) for any e > 0. Thus, on the set 



5„ = { max WfnjWl - ll/nif < ^dnU 2^}, 
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the number of {j : \\fnj\\n > 2e(i„n~^'^} can not exceed the number of {j : 
WfnjW^ > sdnU''^'^}, which is bounded by 0{?7,^''Amax(S)}. By taking e = C5/2, 
we have 

P[\M,J < 0{n'''X^U^)}] > PiB„). 

The conclusion follows from Theorem [T](i). 

It remains to prove (!37|) . Note that (1371) is more related to the joint regres- 
sion rather than the marginal regression. Let 

ctn = argmiuQji^^y — ^^ct^ , 

which is the joint regression coefficients in the population. By the score equa- 
tion of Ctn, we get 

- = 0. 

Hence 

Now, it follows from the orthogonal decomposition that 

var(F) = var(*^Q;„) + var(F - ^^CKn)- 
Since var(y) = 0(1), we conclude that var(^'"^Q:„,) = 0(1), i.e. 

a^£^**^a„ = 0(1). 
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This completes the proof. □. 
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A APPENDIX: Tables for Simulation Results 
of Section 5.3 



46 



Table 4: Average values of the numbers of true (TP), false (FP) positives, 
prediction error (PE), computation time (Time) for Example 6 (t = 0). Robust 
standard deviations are given in parentheses. 



SNR dn Method TP FP PE Time 





2 


INIS 


3.96 


;o.oo) 


2.28(1.49) 


7.74 


;o.79) 


16.09(5.32) 






penGAM 


4.00 


;o.oo) 


27.85(16.98) 


8.07 


;o.92) 


354.46(31.48) 




A 


INIS 


3.93 


;o.oo) 


2.29(1.68) 


7.90 


;o.8i) 


21.68(8.95) 


n ^ 

yj.o 


penGAM 


3.99 


;o.oo) 


25.61(13.62) 


8.21 


;o.84) 


421.17(35.71) 


8 


INIS 


3.81 


;o.oo) 


2.59(2.24) 


8.16 


;i.08) 


33.10(15.79) 






penGAM 


3.95 


;o.oo) 


34.59(20.34) 


8.49 


;o.82) 


484.17(179.70) 




1 fi 

-L \J 


INIS 


3.38 


;o.75) 


2.02(1.49) 


8.60 


;i.i3) 


42.69(20.13) 




penGAM 


3.74 


;o.oo) 


33.48(23.88) 


9.04 


;o.93) 


685.97(267.43) 




2 


INIS 


4.00 


;o.oo) 


2.16(2.24) 


3.98 


;o.34) 


16.03(5.74) 






penGAM 


4.00 


;o.oo) 


26.51(14.18) 


4.20 


;o.46) 


284.85(20.30) 




4 


INIS 


4.00 


;o.oo) 


2.08(1.49) 


3.97 


;o.45) 


20.80(8.57) 


1 n 




penGAM 


4.00 


;o.oo) 


28.33(15.49) 


4.24 


;o.47) 


362.02(81.43) 


8 


INIS 


4.00 


;o.oo) 


2.72(2.24) 


4.04 


;o.43) 


35.79(18.38) 




o 


penGAM 


4.00 


;o.oo) 


36.50(21.83) 


4.37 


;o.47) 


427.60(152.53) 




X u 


INIS 


4.00 


;o.oo) 


1.80(1.49) 


4.26 


;o.45) 


46.81(21.47) 




penGAM 


4.00 


;o.oo) 


38.60(19.78) 


4.80 


;o.57) 


595.87(197.06) 




2 


INIS 


4.00 


;o.oo) 


2.03(2.24) 


2.12 


;o.i7) 


15.92(5.42) 






penGAM 


4.00 


;o.oo) 


25.89(13.06) 


2.25 


;o.24) 


235.69(13.32) 




4 


INIS 


4.00 


;o.oo) 


2.38(2.24) 


2.06 


;o.22) 


23.54(9.08) 


2.0 




penGAM 


4.00 


;o.oo) 


30.37(17.16) 


2.21 


;o.26) 


341.13(19.44) 


8 


INIS 


4.00 


;o.oo) 


2.79(2.24) 


2.03 


;o.2i) 


38.56(19.58) 






penGAM 


4.00 


;o.oo) 


38.51(16.42) 


2.24 


;o.26) 


396.84(20.51) 




16 


INIS 


4.00 


;o.oo) 


1.77(1.49) 


2.17 


;o.25) 


48.40(24.65) 




penGAM 


4.00 


;o.oo) 


42.58(16.60) 


2.54 


;o.3o) 


540.89(165.39) 




2 


INIS 


4.00 


;o.oo) 


2.06(2.24) 


1.19 


;o.i3) 


17.74(6.42) 






penGAM 


4.00 


;o.oo) 


28.57(14.37) 


1.27 


;o.i5) 


213.43(12.09) 




4 


INIS 


4.00 


;o.oo) 


2.33(1.49) 


1.09 


;o.io) 


23.28(9.37) 


4.0 




penGAM 


4.00 


;o.oo) 


30.75(17.35) 


1.18 


;o.i4) 


300.69(12.21) 


8 


INIS 


4.00 


;o.oo) 


2.88(2.24) 


1.02 


;o.i2) 


39.21(19.17) 






penGAM 


4.00 


;o.oo) 


40.51(17.54) 


1.14 


;o.i4) 


340.06(11.49) 




16 


INIS 


4.00 


;o.oo) 


1.72(1.49) 


1.10 


;o.i2) 


49.79(25.78) 




penGAM 


4.00 


;o.oo) 


45.77(19.03) 


1.33 


;o.i6) 


481.19(141.51) 
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Table 5: Average values of the numbers of true (TP), false (FP) positives, 
prediction error (PE), computation time (Time) for Example 6 (t = 1). Robust 
standard deviations are given in parentheses. 



SNR dn Method TP FP PE Time 





2 


INIS 


3.35 


;o.75) 


33.67(8.96) 


9.49 


;i.28) 


196.87(91.48) 






penGAM 


3.10 


;o.oo) 


17.74(15.11) 


7.92 


;o.89) 


1107.78(385.95) 




4 


INIS 


3.02 


;o.oo) 


20.22(2.43) 


8.70 


;i.i4) 


109.51(56.11) 




penGAM 


2.78 


;o.oo) 


15.91(10.07) 


7.99 


;o.9i) 


734.08(227.55) 


8 


INIS 


2.51 


;o.75) 


10.48(0.75) 


8.37 


;o.89) 


65.12(16.64) 






penGAM 


2.59 


;o.75) 


16.47(9.70) 


8.13 


;o.9o) 


624.31(56.23) 




1 fi 

-L \J 


INIS 


2.10 


;o.oo) 


4.47(0.75) 


8.44 


;i.oo) 


46.84(15.61) 




penGAM 


2.41 


;o.75) 


15.56(10.63) 


8.42 


;o.97) 


786.45(244.02) 




2 


INIS 


3.83 


;o.oo) 


32.46(9.70) 


4.86 


;o.6o) 


164.97(64.14) 






penGAM 


3.64 


;o.75) 


24.61(21.08) 


4.19 


;o.49) 


849.23(294.03) 




4 


INIS 


3.56 


;o.75) 


20.53(1.68) 


4.42 


;o.52) 


118.14(43.97) 


1 n 




penGAM 


3.46 


;o.75) 


22.07(16.04) 


4.18 


;o.49) 


614.93(97.36) 


8 


INIS 


3.09 


;o.oo) 


10.67(0.75) 


4.28 


;o.49) 


71.16(32.10) 




o 


penGAM 


3.12 


;o.oo) 


19.92(10.63) 


4.30 


;o.5o) 


548.60(33.88) 




1 f\ 

1 u 


INIS 


2.68 


;o.75) 


4.18(0.75) 


4.45 


;o.52) 


46.08(15.35) 




penGAM 


2.95 


;o.oo) 


16.39(11.19) 


4.57 


;o.55) 


710.56(199.86) 




2 


INIS 


3.99 


;o.oo) 


29.45(11.57) 


2.55 


;o.38) 


139.67(70.45) 






penGAM 


3.97 


;o.oo) 


36.57(22.57) 


2.25 


;o.28) 


626.84(210.44) 




4 


INIS 


3.93 


;o.oo) 


19.12(3.73) 


2.26 


;o.24) 


111.01(21.82) 


2.0 




penGAM 


3.91 


;o.oo) 


31.31(20.52) 


2.19 


;o.23) 


481.87(52.11) 


8 


INIS 


3.50 


;o.75) 


10.29(0.75) 


2.21 


;o.23) 


78.06(32.23) 






penGAM 


3.71 


;o.75) 


27.06(19.03) 


2.28 


;o.29) 


448.38(26.63) 




16 


INIS 


2.93 


;o.oo) 


4.07(0.00) 


2.42 


;o.32) 


51.69(1.10) 




penGAM 


3.22 


;o.oo) 


19.51(12.13) 


2.53 


;o.3o) 


661.93(46.27) 




2 


INIS 


4.00 


;o.oo) 


29.47(11.38) 


1.45 


;o.2i) 


144.22(72.54) 






penGAM 


4.00 


;o.oo) 


37.27(20.71) 


1.27 


;o.i7) 


533.98(69.29) 




4 


INIS 


3.99 


;o.oo) 


17.36(5.22) 


1.17 


;o.i2) 


102.97(32.71) 


4.0 




penGAM 


4.00 


;o.oo) 


38.71(20.34) 


1.16 


;o.ii) 


403.32(28.29) 


8 


INIS 


3.78 


;o.oo) 


10.00(0.00) 


1.13 


;o.i6) 


88.79(12.02) 






penGAM 


3.99 


;o.oo) 


41.42(15.86) 


1.19 


;o.i3) 


402.92(16.94) 




16 


INIS 


3.02 


;o.oo) 


3.98(0.00) 


1.36 


;o.i5) 


49.13(1.85) 




penGAM 


3.72 


;o.75) 


29.58(19.40) 


1.43 


;o.i8) 


556.31(35.48) 
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